library(dplyr)
library(magrittr)
library(ggplot2)
library(coefplot)
## here we are going to find the root_file
root <- rprojroot::find_rstudio_root_file()
## we are going to set the data file path here
dataDir <- file.path(root, 'Data')
We will first read data from Jared Lander’s website. This data comes from the NYC open data portal but has already been cleaned. It represents housing sales between 2011-2021 in NYC.
housing <- readr::read_csv("http://www.jaredlander.com/data/housing.csv")
let’s make the names something more readable:
names(housing) <- c("Neighborhood", "Class", "Units", "YearBuilt",
"SqFt", "Income", "IncomePerSqFt", "Expense",
"ExpensePerSqFt", "NetIncome", "Value",
"ValuePerSqFt", "Boro")
One measure of the relationship between two Variables is called correlation. The correlation output is between -1 and 1 with negative meaning an inverse relationship and positive meaning closer relationship.
cor(housing$SqFt, housing$Income)
## [1] 0.8560584
we can even see the correlation between multiple variables.
cor(housing %>% select(c(Income, SqFt, Units, ValuePerSqFt)))
## Income SqFt Units ValuePerSqFt
## Income 1.0000000 0.8560584 0.7557462 0.4449276
## SqFt 0.8560584 1.0000000 0.9577540 0.2148611
## Units 0.7557462 0.9577540 1.0000000 0.1435213
## ValuePerSqFt 0.4449276 0.2148611 0.1435213 1.0000000
We have already seen this plot but let’s plot our new data with ggpairs
GGally::ggpairs(housing %>% select(c(Income, SqFt, Units, ValuePerSqFt)))
Related to correlation, covariance is like a variance between variables. It provides a measure of the strength of the correlation between two or more sets of random variates. Unlike correlation which is divided by the standard deviation the numbers can be outside of -1 and 1.
cov(housing$Income, housing$ValuePerSqFt)
## [1] 141382416
Like correlation, we can compare the covariance between multiple variables.
cov(housing %>% select(c(Income, SqFt, Units, ValuePerSqFt)))
## Income SqFt Units ValuePerSqFt
## Income 2.149034e+13 571556408169 4.734186e+08 1.413824e+08
## SqFt 5.715564e+11 20742831383 1.863956e+07 2.121173e+06
## Units 4.734186e+08 18639557 1.825977e+04 1.329378e+03
## ValuePerSqFt 1.413824e+08 2121173 1.329378e+03 4.698603e+03
Let’s take a look at square feet and possible rental income.
housing %>% select(Income, SqFt) %>% head
## # A tibble: 6 x 2
## Income SqFt
## <dbl> <dbl>
## 1 1332615 36500
## 2 6633257 126420
## 3 17310000 554174
## 4 11776313 249076
## 5 10004582 219495
## 6 5127687 139719
And we can plot the two variables and draw the regression line.
ggplot(housing, aes(x = SqFt, y = Income)) + geom_point(alpha = 1/6, shape = 1) + geom_smooth(method = 'lm') + theme_minimal() +
ylab("Square Feet") + xlab("Income")
let’s actually get the model using the lm function.
We are using formula notation. On the left of the ~ we are adding our response variable (income), and on the right we are adding our one predictor variable(SqFt).
The results show two coefficients. One for the intercept
sqftLM <- lm(Income ~ SqFt, data = housing)
sqftLM
##
## Call:
## lm(formula = Income ~ SqFt, data = housing)
##
## Coefficients:
## (Intercept) SqFt
## 360400.29 27.55
sqftInfo <- summary(sqftLM)
sqftInfo
##
## Call:
## lm(formula = Income ~ SqFt, data = housing)
##
## Residuals:
## Min 1Q Median 3Q Max
## -55527201 -585422 -408030 99571 28392555
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.604e+05 5.394e+04 6.681 2.88e-11 ***
## SqFt 2.755e+01 3.248e-01 84.839 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2397000 on 2624 degrees of freedom
## Multiple R-squared: 0.7328, Adjusted R-squared: 0.7327
## F-statistic: 7198 on 1 and 2624 DF, p-value: < 2.2e-16
## MSE
sum(residuals(sqftLM)^2) / df.residual(sqftLM)
## [1] 5.743635e+12
##RMSE
sqrt(sum(residuals(sqftLM)^2) / df.residual(sqftLM))
## [1] 2396588
sqftCoef <- as.data.frame(sqftInfo$coefficients[,1:2])
sqftCoef <- within(sqftCoef, {
Lower <- Estimate - qt(p=.9, df = sqftInfo$df[2]) * `Std. Error`
Upper <- Estimate + qt(p=.9, df = sqftInfo$df[2]) * `Std. Error`
coef <- row.names(sqftCoef)
})
ggplot(sqftCoef, aes(x=Estimate, y=coef)) + geom_point() + geom_errorbarh(aes(xmin=Lower, xmax = Upper), height = .3) +
ggtitle("Rental Comparble Income by Square Feet")
percentage of variance explained
#r^2
1 - sum((housing$Income-predict.lm(sqftLM, data.frame(SqFt =housing$SqFt)))^2)/
sum((housing$Income-mean(housing$Income))^2)
## [1] 0.732836
If we want to see the outliers the car package has some nice tools. The function outlierTest gives the studentized residuals which highlights the extreme outliers in our data. \[t_i = r_i (\frac{n-k-2}{n-k-1-r_i^2})^{1/2}\] In the example below we can see that the red data point greatly effects our linear regression.
The studentized resideual(TRES1) is -19.7990 and absolute values larger than 3 can be called outliers.
car::outlierTest(sqftLM)
## rstudent unadjusted p-value Bonferonni p
## 2569 -29.975974 5.0079e-170 1.3151e-166
## 2568 -20.379881 7.9436e-86 2.0860e-82
## 2567 -13.794873 7.6305e-42 2.0038e-38
## 932 12.243080 1.4805e-33 3.8878e-30
## 732 8.857920 1.4706e-18 3.8618e-15
## 1129 6.169128 7.9291e-10 2.0822e-06
## 2022 -5.978266 2.5615e-09 6.7265e-06
## 890 5.706572 1.2821e-08 3.3667e-05
## 735 5.531466 3.4896e-08 9.1636e-05
## 736 5.527662 3.5652e-08 9.3621e-05
car::leveragePlots(sqftLM)
The influencePlot function creates a “bubble” plot of Studentized residuals versus hat values, with the areas of the circles representing the observations proportional to the value Cook’s distance. Vertical reference lines are drawn at twice and three times the average hat value, horizontal reference lines at -2, 0, and 2 on the Studentized-residual scale.
cutoff <- 4/((nrow(housing)-length(sqftLM$coefficients)-2))
plot(sqftLM, which=4, cook.levels=cutoff)
car::influencePlot(sqftLM, id.method="identify", main="Influence Plot", sub="Circle size is proportial to Cook's Distance" )
## StudRes Hat CookD
## 2568 -20.37988 0.1284836 26.44070
## 2569 -29.97597 0.1982310 82.76901
Plots empirical quantiles of a variable, or of studentized residuals from a linear model, against theoretical quantiles of a comparison distribution.
car::qqPlot(sqftLM, main="QQ Plot")
## [1] 2568 2569
sresid <- MASS::studres(sqftLM)
data.frame(sresid) %>% ggplot(aes(x =sresid)) +
geom_histogram(binwidth = 0.5) + theme_minimal()
Creates plots for examining the possible dependence of spread on level, or an extension of these plots to the studentized residuals from linear models. We’re interested in the variability of a variable is unequal across the range of values of a second variable that predicts it.
# Evaluate homoscedasticity
# non-constant error variance test
car::ncvTest(sqftLM)
## Non-constant Variance Score Test
## Variance formula: ~ fitted.values
## Chisquare = 114180.4, Df = 1, p = < 2.22e-16
# plot studentized residuals vs. fitted values
car::spreadLevelPlot(sqftLM)
##
## Suggested power transformation: 0.2425346
crPlots constructs component+residual plots, also called partial-residual plots, for linear and generalized linear models.
This is a good way to see if the predictors have a linear relationship to the dependent variable. A partial residual plot basically attempts to model the residuals of one predictor against the dependent variable. A component residual plot adds a line indicating where the line of best fit lies. A significant difference between the residual line and the component line indicates that the predictor does not have a linear relationship with the dependent variable
car::crPlots(sqftLM)
the Durbin-Watston statistic is used to detect the presence of autocorrelation at lag 1 in the residuals (prediction errors) from a regression analysis. Remember that we want our predictor variables to be independent. The Durbin-Watson statistic will always have a value between 0 and 4. A value of 2.0 means that there is no autocorrelation detected in the sample. Values from 0 to less than 2 indicate positive autocorrelation and values from from 2 to 4 indicate negative autocorrelation.
Positive autocorrelation means that if the previous value declines, the next value will also decline, while a negative autocorrelation has the opposite effect.
# Test for Autocorrelated Errors
car::durbinWatsonTest(sqftLM)
## lag Autocorrelation D-W Statistic p-value
## 1 0.5187698 0.9621784 0
## Alternative hypothesis: rho != 0
layout(matrix(c(1,2,3,4),2,2))
plot(sqftLM)
What happens when you want to have more than two variables and include new predictor variables? You will \[ Y = \beta_0 + \beta_1*x_1 + \beta_2*x_2 + ...\text{Error}\]
you have to make sure all the relevant variables are in the model or else you will have a systematic error when we want our error to be random.
What happens if you have more than one predictor? you’ll have to use multiple regression.
and we’ll have to explore our data. Let’s
ggplot(housing, aes(x = ValuePerSqFt)) + geom_histogram()
There seems to be a bimodal distribution with peaks at around 80 and 200. We can explore further by mapping colors with boroughs. We can see that Manhattan and Brooklyn accounted for the main peaks exhibiting different distributions.
ggplot(housing, aes(x = ValuePerSqFt, fill = Boro)) + geom_histogram(binwidth = 10) +
facet_wrap(~ Boro)
How about other variables? what does square footage or units look like?
We can see that for sq footage, there are some with extremely large area. For the number of units, it follows the same distribution, where most of the sales are in the lower ranges and a few outliers with large numbers.
ggplot(housing, aes(SqFt)) + geom_histogram()
ggplot(housing, aes(Units)) + geom_histogram()
If we plot SqFt and Units against ValuePerSqFt, we can see if we can trim some outliers.
ggplot(housing, aes(SqFt, ValuePerSqFt)) + geom_point()
ggplot(housing, aes(Units, ValuePerSqFt)) + geom_point()
Based on the previous graphs if we can trim units below 1000 and sqft below.
ggplot(housing %>% filter(Units < 1000), aes(Units, ValuePerSqFt)) + geom_point()
housing <- housing %>% filter(Units < 1000)
now let’s fit a model condo1 with our response variable being ValuePerSqFt and predictor variables Units, SqFt, and Boro.
housingFact <- housing %>% mutate(Boro = as.factor(Boro))
condo1 <- lm(ValuePerSqFt ~ Units + SqFt + Boro + YearBuilt, data = housingFact)
We can view the summary of the model
summary(condo1)
##
## Call:
## lm(formula = ValuePerSqFt ~ Units + SqFt + Boro, data = housing)
##
## Residuals:
## Min 1Q Median 3Q Max
## -168.458 -22.680 1.493 26.290 261.761
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.430e+01 5.342e+00 8.293 < 2e-16 ***
## Units -1.532e-01 2.421e-02 -6.330 2.88e-10 ***
## SqFt 2.070e-04 2.129e-05 9.723 < 2e-16 ***
## BoroBrooklyn 3.258e+01 5.561e+00 5.858 5.28e-09 ***
## BoroManhattan 1.274e+02 5.459e+00 23.343 < 2e-16 ***
## BoroQueens 3.011e+01 5.711e+00 5.272 1.46e-07 ***
## BoroStaten Island -7.114e+00 1.001e+01 -0.711 0.477
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 43.2 on 2613 degrees of freedom
## Multiple R-squared: 0.6034, Adjusted R-squared: 0.6025
## F-statistic: 662.6 on 6 and 2613 DF, p-value: < 2.2e-16
and view only the coefficients
condo1$coefficients
## (Intercept) Units SqFt BoroBrooklyn
## 4.430325e+01 -1.532405e-01 2.069727e-04 3.257554e+01
## BoroManhattan BoroQueens BoroStaten Island
## 1.274259e+02 3.011000e+01 -7.113688e+00
with the coefplot library we can view the coefficients of out model
library(coefplot)
coefplot(condo1, sort = 'mag')
We can now see what difference scale has on our data for Units and SqFt.
condo2 <- lm(ValuePerSqFt ~ scale(Units) + scale(SqFt) + Boro, data = housing)
summary(condo2)
##
## Call:
## lm(formula = ValuePerSqFt ~ scale(Units) + scale(SqFt) + Boro,
## data = housing)
##
## Residuals:
## Min 1Q Median 3Q Max
## -168.458 -22.680 1.493 26.290 261.761
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 50.459 5.320 9.484 < 2e-16 ***
## scale(Units) -14.142 2.234 -6.330 2.88e-10 ***
## scale(SqFt) 21.949 2.257 9.723 < 2e-16 ***
## BoroBrooklyn 32.576 5.561 5.858 5.28e-09 ***
## BoroManhattan 127.426 5.459 23.343 < 2e-16 ***
## BoroQueens 30.110 5.711 5.272 1.46e-07 ***
## BoroStaten Island -7.114 10.008 -0.711 0.477
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 43.2 on 2613 degrees of freedom
## Multiple R-squared: 0.6034, Adjusted R-squared: 0.6025
## F-statistic: 662.6 on 6 and 2613 DF, p-value: < 2.2e-16
coefplot(condo2, sort = 'mag')
multiplot(condo1, condo2)
Sometimes two variables interact and are important features to capture. In R, you either use * or : to create unique combiniations of predictor variables on the dependent variable. The difference between the two is that * will keep the original terms while : will only keep the interaction terms and no the original terms.
condo2 <- lm(ValuePerSqFt ~ Units + SqFt* Boro + YearBuilt, data = housingFact)
summary(condo2)
##
## Call:
## lm(formula = ValuePerSqFt ~ Units + SqFt * Boro + YearBuilt,
## data = housingFact)
##
## Residuals:
## Min 1Q Median 3Q Max
## -176.87 -20.38 0.94 24.04 253.02
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.890e+02 4.645e+01 -12.681 < 2e-16 ***
## Units -1.391e-01 2.393e-02 -5.812 6.94e-09 ***
## SqFt 9.070e-05 9.442e-05 0.961 0.336840
## BoroBrooklyn 3.041e+01 7.809e+00 3.895 0.000101 ***
## BoroManhattan 1.319e+02 7.766e+00 16.982 < 2e-16 ***
## BoroQueens 2.957e+01 7.982e+00 3.705 0.000216 ***
## BoroStaten Island -6.308e+00 1.685e+01 -0.374 0.708084
## YearBuilt 3.220e-01 2.307e-02 13.960 < 2e-16 ***
## SqFt:BoroBrooklyn 2.345e-05 9.606e-05 0.244 0.807153
## SqFt:BoroManhattan 1.028e-04 9.268e-05 1.109 0.267565
## SqFt:BoroQueens 1.619e-05 9.613e-05 0.168 0.866281
## SqFt:BoroStaten Island 3.095e-05 1.756e-04 0.176 0.860101
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 41.88 on 2512 degrees of freedom
## (96 observations deleted due to missingness)
## Multiple R-squared: 0.6317, Adjusted R-squared: 0.6301
## F-statistic: 391.7 on 11 and 2512 DF, p-value: < 2.2e-16
coefplot(condo2, sort = 'mag', coefficients = condo2$coefficients[-1] %>% names())
Here is an example using the : to remove the original variables and keeping only the interactions.
condo3 <- lm(ValuePerSqFt ~ Units + SqFt:Boro + YearBuilt, data = housingFact)
summary(condo3)
##
## Call:
## lm(formula = ValuePerSqFt ~ Units + SqFt:Boro + YearBuilt, data = housingFact)
##
## Residuals:
## Min 1Q Median 3Q Max
## -250.47 -40.83 -10.74 39.10 263.01
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.364e+02 5.485e+01 6.134 9.92e-10 ***
## Units -1.325e-01 3.195e-02 -4.147 3.48e-05 ***
## YearBuilt -1.076e-01 2.790e-02 -3.855 0.000119 ***
## SqFt:BoroBronx -5.620e-04 8.922e-05 -6.299 3.52e-10 ***
## SqFt:BoroBrooklyn -2.088e-04 4.069e-05 -5.131 3.09e-07 ***
## SqFt:BoroManhattan 3.913e-04 2.744e-05 14.257 < 2e-16 ***
## SqFt:BoroQueens -2.096e-04 4.355e-05 -4.812 1.58e-06 ***
## SqFt:BoroStaten Island -5.489e-04 1.180e-04 -4.654 3.43e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 55.91 on 2516 degrees of freedom
## (96 observations deleted due to missingness)
## Multiple R-squared: 0.3426, Adjusted R-squared: 0.3408
## F-statistic: 187.3 on 7 and 2516 DF, p-value: < 2.2e-16
coefplot(condo3, sort = 'mag', coefficients = condo3$coefficients[-1] %>% names())
Here is an example of adding more interactions terms.
condo4 <- lm(ValuePerSqFt ~ Units + SqFt*Boro + SqFt*Class + YearBuilt, data = housingFact)
summary(condo4)
##
## Call:
## lm(formula = ValuePerSqFt ~ Units + SqFt * Boro + SqFt * Class +
## YearBuilt, data = housingFact)
##
## Residuals:
## Min 1Q Median 3Q Max
## -168.699 -20.178 -1.007 23.106 249.155
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.741e+02 4.645e+01 -12.360 < 2e-16 ***
## Units -7.507e-02 2.427e-02 -3.093 0.002005 **
## SqFt -3.817e-05 1.175e-04 -0.325 0.745261
## BoroBrooklyn 2.642e+01 7.642e+00 3.458 0.000554 ***
## BoroManhattan 1.252e+02 7.721e+00 16.212 < 2e-16 ***
## BoroQueens 2.414e+01 7.817e+00 3.089 0.002032 **
## BoroStaten Island -1.553e+01 1.648e+01 -0.943 0.346003
## ClassR4-CONDOMINIUM 1.242e+01 3.050e+00 4.073 4.78e-05 ***
## ClassR9-CONDOMINIUM 1.281e+01 4.808e+00 2.664 0.007777 **
## ClassRR-CONDOMINIUM -4.421e+01 8.150e+00 -5.424 6.38e-08 ***
## YearBuilt 3.128e-01 2.314e-02 13.515 < 2e-16 ***
## SqFt:BoroBrooklyn 1.897e-05 9.400e-05 0.202 0.840089
## SqFt:BoroManhattan 1.288e-04 9.096e-05 1.416 0.156966
## SqFt:BoroQueens 3.153e-05 9.397e-05 0.335 0.737295
## SqFt:BoroStaten Island 2.813e-05 1.714e-04 0.164 0.869688
## SqFt:ClassR4-CONDOMINIUM 7.053e-05 7.566e-05 0.932 0.351322
## SqFt:ClassR9-CONDOMINIUM -5.208e-05 7.810e-05 -0.667 0.504928
## SqFt:ClassRR-CONDOMINIUM 1.578e-04 8.514e-05 1.853 0.064040 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 40.75 on 2506 degrees of freedom
## (96 observations deleted due to missingness)
## Multiple R-squared: 0.6521, Adjusted R-squared: 0.6498
## F-statistic: 276.3 on 17 and 2506 DF, p-value: < 2.2e-16
coefplot(condo4, sort = 'mag', coefficients = condo4$coefficients[-1] %>% names())
The stepwise regression consists of iteratively removing or adding predictors in order to find the best subset of variables in the data resulting in the best model. The criteria for choosing the best model is usually dictated by the lowest prediction errors through RMSEs.
There are three ways stepwise selection can be operatinalized:
Forward which starts with no predictors and iteratively adds the most predictive variable. It stops when the improvement is no longer statistcally significant.
Backward starts with all the predictors and iteratively removes the least predictive variable. This again stops when there is no longer improvement in the model.
Stepwise(both) this is a combination of forward and backward selections. You start with no variables like the forward method, but then you also remove any variabels that no longer provide improvement like the backward selection.
Some view this method as mere data dredging, and is a poor substitute for real subject matter expertise.
Another metric used often is the AIC, it is an estimator of the relative quality of statistical models for a given set of data. To derive the value the equation is: \[\text{AIC} = 2k-2ln(\hat{L})\] where \(\hat{L}\) is the maximum value of the likelihood function for the model and \(k\) is the number of estimated parameters in the model. The model with the lowest AIC value is the preferred model. As you can see AIC favors goodness of fit by the likelihood function and penalizes increasing number of parameters. In a simple linear regression there are three parameters, \(\beta_0\) (the intercept), \(\beta_1\) (the cofficient) and the variance of the Gaussian distributions. For any least squares model the residuals are expected to be random and normally distributed and counted as one of the parameters.
Here is an example of running through the different permutations of each step.
nullModel <- lm(ValuePerSqFt ~ 1, data = housingFact %>% filter(complete.cases(.)))
step(nullModel,
scope = list(lower = nullModel, upper = condo4),
direction = "both")
## Start: AIC=21364.55
## ValuePerSqFt ~ 1
##
## Df Sum of Sq RSS AIC
## + Boro 4 6878123 5085328 19213
## + SqFt 1 1236641 10726810 21091
## + Class 3 1247951 10715500 21092
## + Units 1 716484 11246967 21211
## + YearBuilt 1 113894 11849557 21342
## <none> 11963451 21364
##
## Step: AIC=19213.27
## ValuePerSqFt ~ Boro
##
## Df Sum of Sq RSS AIC
## + YearBuilt 1 475055 4610273 18968
## + Class 3 250325 4835003 19092
## + SqFt 1 188347 4896981 19120
## + Units 1 85383 4999945 19172
## <none> 5085328 19213
## - Boro 4 6878123 11963451 21364
##
## Step: AIC=18967.74
## ValuePerSqFt ~ Boro + YearBuilt
##
## Df Sum of Sq RSS AIC
## + Class 3 201095 4409177 18861
## + SqFt 1 105177 4505096 18912
## + Units 1 36715 4573557 18950
## <none> 4610273 18968
## - YearBuilt 1 475055 5085328 19213
## - Boro 4 7239284 11849557 21342
##
## Step: AIC=18861.17
## ValuePerSqFt ~ Boro + YearBuilt + Class
##
## Df Sum of Sq RSS AIC
## + SqFt 1 110040 4299137 18799
## + Units 1 51735 4357443 18833
## <none> 4409177 18861
## - Class 3 201095 4610273 18968
## - YearBuilt 1 425825 4835003 19092
## - Boro 4 6165631 10574808 21061
##
## Step: AIC=18799.38
## ValuePerSqFt ~ Boro + YearBuilt + Class + SqFt
##
## Df Sum of Sq RSS AIC
## + SqFt:Class 3 73770 4225367 18762
## + SqFt:Boro 4 52406 4246731 18776
## + Units 1 37092 4262045 18780
## <none> 4299137 18799
## - SqFt 1 110040 4409177 18861
## - Class 3 205959 4505096 18912
## - YearBuilt 1 354111 4653248 18997
## - Boro 4 5449763 9748900 20858
##
## Step: AIC=18761.69
## ValuePerSqFt ~ Boro + YearBuilt + Class + SqFt + Class:SqFt
##
## Df Sum of Sq RSS AIC
## + SqFt:Boro 4 47787 4177580 18741
## + Units 1 22666 4202701 18750
## <none> 4225367 18762
## - Class:SqFt 3 73770 4299137 18799
## - YearBuilt 1 358309 4583676 18965
## - Boro 4 5356184 9581550 20820
##
## Step: AIC=18740.99
## ValuePerSqFt ~ Boro + YearBuilt + Class + SqFt + Class:SqFt +
## Boro:SqFt
##
## Df Sum of Sq RSS AIC
## + Units 1 15885 4161695 18733
## <none> 4177580 18741
## - Boro:SqFt 4 47787 4225367 18762
## - Class:SqFt 3 69151 4246731 18776
## - YearBuilt 1 300373 4477952 18914
##
## Step: AIC=18733.37
## ValuePerSqFt ~ Boro + YearBuilt + Class + SqFt + Units + Class:SqFt +
## Boro:SqFt
##
## Df Sum of Sq RSS AIC
## <none> 4161695 18733
## - Units 1 15885 4177580 18741
## - Boro:SqFt 4 41006 4202701 18750
## - Class:SqFt 3 57437 4219132 18762
## - YearBuilt 1 303327 4465022 18909
##
## Call:
## lm(formula = ValuePerSqFt ~ Boro + YearBuilt + Class + SqFt +
## Units + Class:SqFt + Boro:SqFt, data = housingFact %>% filter(complete.cases(.)))
##
## Coefficients:
## (Intercept) BoroBrooklyn
## -5.741e+02 2.642e+01
## BoroManhattan BoroQueens
## 1.252e+02 2.414e+01
## BoroStaten Island YearBuilt
## -1.553e+01 3.128e-01
## ClassR4-CONDOMINIUM ClassR9-CONDOMINIUM
## 1.242e+01 1.281e+01
## ClassRR-CONDOMINIUM SqFt
## -4.421e+01 -3.817e-05
## Units ClassR4-CONDOMINIUM:SqFt
## -7.507e-02 7.053e-05
## ClassR9-CONDOMINIUM:SqFt ClassRR-CONDOMINIUM:SqFt
## -5.208e-05 1.577e-04
## BoroBrooklyn:SqFt BoroManhattan:SqFt
## 1.897e-05 1.288e-04
## BoroQueens:SqFt BoroStaten Island:SqFt
## 3.153e-05 2.813e-05
This library shows us the relative importance of variables for our original simple model.
library(relaimpo)
crlm <- calc.relimp(condo1,
type = c("lmg", "last", "first"),
rela = TRUE )
plot(crlm)
mpg <- mpg %>% mutate_if(is.character, ~as.factor(.x))
carsLm <- lm(hwy~displ+year+cyl+trans+drv+fl+class, mpg)
stepMod <- stepAIC(carsLm, direction = "both")
## Start: AIC=312.83
## hwy ~ displ + year + cyl + trans + drv + fl + class
##
## Df Sum of Sq RSS AIC
## - trans 9 21.02 740.50 301.57
## <none> 719.48 312.83
## - displ 1 30.02 749.50 320.40
## - cyl 1 48.82 768.30 326.19
## - drv 2 114.03 833.51 343.26
## - year 1 109.35 828.83 343.94
## - class 6 420.05 1139.53 408.43
## - fl 4 569.49 1288.98 441.27
##
## Step: AIC=301.57
## hwy ~ displ + year + cyl + drv + fl + class
##
## Df Sum of Sq RSS AIC
## <none> 740.50 301.57
## - displ 1 33.98 774.48 310.07
## + trans 9 21.02 719.48 312.83
## - cyl 1 54.72 795.22 316.25
## - drv 2 127.95 868.46 334.87
## - year 1 139.32 879.82 339.91
## - class 6 474.55 1215.06 405.45
## - fl 4 580.57 1321.08 429.03
stepMod$anova
## Stepwise Model Path
## Analysis of Deviance Table
##
## Initial Model:
## hwy ~ displ + year + cyl + trans + drv + fl + class
##
## Final Model:
## hwy ~ displ + year + cyl + drv + fl + class
##
##
## Step Df Deviance Resid. Df Resid. Dev AIC
## 1 209 719.4808 312.8309
## 2 - trans 9 21.02381 218 740.5047 301.5705
carsim <- calc.relimp(carsLm,
type = c("lmg", "last", "first"),
rela = TRUE )
plot(carsim)
Here is a table of the different types of generalized linear models that could be fit using the glm function.
housingBoro <- housing %>% filter(Boro %in% c("Manhattan", "Brooklyn")) %>%
mutate(Boro = as.factor(Boro)) %>%
dplyr::select(-c("Neighborhood", "Class"))
Boro1 <- glm(Boro ~. , data = housingBoro,
family = binomial(link="logit"))
summary(Boro1)
##
## Call:
## glm(formula = Boro ~ ., family = binomial(link = "logit"), data = housingBoro)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.6332 -0.2926 0.0409 0.1444 3.7449
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.720e+01 4.341e+00 8.569 < 2e-16 ***
## Units 3.902e-03 4.320e-03 0.903 0.3664
## YearBuilt -2.334e-02 2.182e-03 -10.700 < 2e-16 ***
## SqFt 1.765e-06 7.084e-06 0.249 0.8032
## Income 5.480e-07 1.120e-06 0.489 0.6246
## IncomePerSqFt 5.163e-01 8.602e-02 6.002 1.95e-09 ***
## Expense -1.451e-07 1.461e-06 -0.099 0.9209
## ExpensePerSqFt -2.741e-01 1.203e-01 -2.279 0.0227 *
## NetIncome NA NA NA NA
## Value -1.127e-07 1.428e-07 -0.789 0.4301
## ValuePerSqFt -1.112e-02 1.042e-02 -1.068 0.2856
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2487.42 on 2001 degrees of freedom
## Residual deviance: 749.86 on 1992 degrees of freedom
## (92 observations deleted due to missingness)
## AIC: 769.86
##
## Number of Fisher Scoring iterations: 8
coefplot(Boro1, sort = 'mag',coefficients = names(Boro1$coefficients)[-1])
Boro2 <- glm(Boro ~IncomePerSqFt , data = housingBoro,
family = binomial)
summary(Boro2)
##
## Call:
## glm(formula = Boro ~ IncomePerSqFt, family = binomial, data = housingBoro)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.0485 -0.3954 0.0364 0.1868 3.8703
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -9.84115 0.47461 -20.73 <2e-16 ***
## IncomePerSqFt 0.41190 0.02009 20.50 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2691.29 on 2093 degrees of freedom
## Residual deviance: 976.37 on 2092 degrees of freedom
## AIC: 980.37
##
## Number of Fisher Scoring iterations: 7
logitPlot <- data.frame(IncomePerSqFt = seq(min(housingBoro$IncomePerSqFt), max(housingBoro$IncomePerSqFt),len=nrow(housingBoro)))
logitPlot$Boro = predict(Boro2, newdata=logitPlot, type="response")
housingBoro %>% mutate(Boro = if_else(Boro == "Manhattan", 1, 0)) %>% # have to reclassify manhattan as 1
ggplot(aes(x=IncomePerSqFt, y=Boro)) + geom_jitter(shape = 1, height = 0.005, alpha = .25) +
geom_line(data = logitPlot, aes(x = IncomePerSqFt, y=Boro), colour = "red") +
theme_minimal()
Glmnet is a package that fits a generalized linear model via penalized maximum likelihood. The regularization path is computed for the lasso or elasticnet penalty at a grid of values for the regularization parameter lambda.
\[\text{min}_{\beta_0, \beta}\frac{1}{N}\sum ^N _{i=1}w_il(y_i,\beta_0+\beta^Tx_i) +\lambda[(1-\alpha)||\beta||_2^2/2+\alpha||\beta||_1]\]
If you recall the concepts of overfitting and parsimony, simplifying models as much as possible is beneficial. Regularization is a popular technique for avoiding overfitting which adds another element to the loss function.
\[L = \Sigma(\hat{Y_i}-Y_i)^2 + \lambda\Sigma\beta^2\]
This loss function has two parts, the first is the squared errors, and the second sums over the squared \(\beta\) and multiples it by a new term \(\lambda\). We want to penalize the model for larger numbers of coeffiencets.
This loss function is slightly different from the previous method
\[L = \Sigma(\hat{Y_i}-Y_i)^2 + \lambda\Sigma|\beta|\]
where you can see the \(\beta\) is now a absolute value. Unlike the Ridge method where coefficients are pushed towards zero, the Lasso will set irelevant coeffiencts to zero.